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Abstract 

Many scientific applications involve grids that lack a uniform underlying structure. These applica 
tions arc often also dynamic in nature in that the grid structure significantly changes between successive 
phases of execution. In parallel computing environments, mesh adaptation of unstructured gri s iroug 
selective refinement/coarsening has proven to be an effective approach. However . achieving load balance 
while minimizing interprocessor communication and redistribution costs is ^difficult pro . 
tional dynamic load balancers are mostly inadequate because they lack a global view sys ei . 
across processors. In this paper, we propose a novel and gcnerabpurpo.se load balancer 
symmetric broadcast networks (SBN) as the underlying communication b>P°logy, and compare its per. 
formance with a successful global load balancing environment, called PLUM, specifically created to 
^le adaptive unstructured applications. Our experimental results on an IBM SP2 demonstrate that 
the SBN-based load balancer achieves lower redistribution costs than that under PLUM by over upping 
processing and data migration. 

Key words: Dynamic load balancing, experimental study, IBM SP2, job migration and redistribution, sym- 
metric broadcast networks, unstructured mesh adaptation 


1 Introduction 

Mesh partitioning is a common approach to parallelize many scientific applicants 
modeled discretely using a mesh (or grid) of vertices and edges. Fo, maximum efr, ctency, the computational 
workloads on the processors have to be balanced and the number of edges that are cut (and hence the ovcraJ 
interprocessor communication cos. a. runiime, needs lo beMnimizcd. For Ms r^o», each vertex 
usually assigned a weight that indicates the amount of computation required to process it. Similarly ea 
edge in the mesh has an associated weight indicating the amount of interaction between adjacent vu it . 
To achieve load balance dynamically, portions of the mesh have to be migrated among processors during 
Jomse of a computation. Thus, in a multiprocessing environment, the vertex weight contains an add, urn 
component that models the cost of redistributing the vertex from one processor to another. These g 

are used to minimize the data redistribution cost during the remapping phase. 

With adaptive meshes, the grid topology changes during (lie cons' of a compuhition. Traditions y O.t 
class of problems is processed by load balancing the mesh after each adaption. A number or partitioned 
dcrigl for this purpose has been proposed in die lileraiurc 18, 1 1. .4. 17, 211. A majority of the success 
ful partitioned are based on a multilevel approach that has proven to be extremely effective in pro u g 
good"ns I reasonable cost. In a multilevel scheme, the grid is fi rst contracted to a small number oi 
vertices and edges, the coarsened grid is next partitioned, and is then fi nally reft ned lo the original using toe 
Kemighan-Lin replacement algorithm [1 2], However, other partitioning methods have also been developed, 
and excellent surveys are provided in [1, 19]. 
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Although several dynamic load balancers have been proposed for multiprocessor platforms [3, 9, 13, 19, 
20], most of them arc inadequate for adaptive mesh applications because they lack a global view of system 
loads across processors. Furthermore, job migration in such approaches does not take into account the 
structure of the adaptive grid. This motivates our present work. In this paper, we overcome these deft ciencies 
by proposing a novel, dynamic load balancer which makes use of a symmetric broadcast network (SBN) as 
a robust and topology-independent communication pattern among processors [6]. Section 2 describes this 
SBN-hased load balancing algorithm. Our earlier experiments with synthetic loads [5] have demonstrated 
that such an SBN strategy achieves superior performance when compared to other popular techniques such 
as Random, Gradient, Receiver Initiated, Sender Initiated, and Adaptive Contracting, 

The SBN-based load balancing algorithm provides an architecture-independent solution in that it gen- 
erates portable codes which can be run without modifi cation on any parallel/distributed platform. This is 
because typical communication patterns such as mesh, hypercube, tree, and torus can be embedded effi - 
ciently within the SBN topology. It is true that the proposed load balancing scheme in its current form may 
not he optimal for a given architecture; however, it can be made so by fi nc tuning the algorithm and properly 
mapping it on the machine by utilizing its hardware specifi cations. 

Recently, experiments that measure the effectiveness of load balancing adaptive meshes have been pre- 
sented in [2, 16] using an automatic portable environment, called PLUM [15], developed at NASA Ames 
Research Center. PLUM uses a novel strategy for load balancing which consists of two separate phases: 
repartitioning and remapping. A brief overview of PLUM, and a description of its salient differences with 
the SBN-based load balancer arc given in Section 3. 

We have conducted several experiments on an IBM SP2 to compare the performance of the SBN-based 
load balancer to that of PLUM. The results, presented in Section 4, demonstrate that the SBN-based al- 
gorithm achieves excellent load balance, and that the redistribution cost is signifi cantly lower than those 
obtained under PLUM when using two state-of-the-art partitioners, PMeTiS [11] and DMeTiS [17]. How- 
ever, die edge cut percentages are higher than those for PMeTiS, indicating that the SBN strategy reduces the 
redistribution cost at die expense of greater communication. In many adaptive mesh applications where die 
data redistribution cost dominates die processing and communication cost [15, 16, 18], tliis is an acceptable 
trade-off. 

2 SBN-Based Load Balancer 

Our proposed SBN-based load balancer, targeted for adaptive mesh computations, can be classified as: 
(i) adaptive , since processing automatically adjusts to die allocated workload; (ii) decentralized, since load 
balancing can be initiated by any processor in the system and is shared by all; (iii) stable , since excessive 
load balancing traffi c does not burden die network; and (iv) effective, since system performance does not 
degrade due to load balancing activities. In diis section, we give die definition of ;ui SBN, and present 
die SBN-based load balancing algorithm. We also describe a pre -parti doner dial can optionally be used to 
assign subdomains to the individual processors before each adaptation step. 

2.1 SBN Definition 

A symmetric broadcast network (SBN), fi rst presented in [6], defi nes a (logical or physical) communication 
pattern among the P processors in a multicomputer system. It is defi ned as follows. 

Definition 1 An SBN(d) of dimension d > 0, is a ( d + 1)- stage interconnection network with P = 2 d 
processors in each stage , and can be constructed recursively A single processor forms the basis network 
SBN(O). For d > 0, SBN(d) is obtained from a pair of SBN [d - l)j by (i) relabeling the processors in the 
second SBN(d - 1 ) from 2 d ~ x to 2 d - 1; (ii) incrementing the identifiers of the existing stages by one and 
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creating a new stage 0 containing processors 0 to 2 d - 1; (Hi) connecting processor i in stage 0 to processor 
j = (j + p/2) mod P of stage l; and (iv) connecting processor j in stage 1 to the processor in stage l ( ij 
present) which was the stage 0 successor of processor i in SBN(d — 1). 

Fig. 1(a) illustrates how an SBN(2) is recursively constructed from two SBN(l)s, while Fig. 1(b) shows 
the construction of an SBN(3) from two SBN(2)s. 



Figure 1: (a) Construction of SBN(2) from a pair of SBN(l)s, and (b) SBN(3) from a pair of SBN(2)s. The 
new connections are shown by solid lines and the original connections by dashed lines. 


Note that an SBN(d) defines unique communication patterns (or broadcast trees) among the processors 
in the network. In other words, for any root processor x at stage 0, where 0 < x < P, there exists a unique 
broadcast tree T x of height d = log P such that each of the 2 d processors appears exactly once, urt ieri ™ rc > 
the SBN communication pattern for x can be derived from the template broadcast tree wi processor as 
the source [51. The predecessor and successors of each processor are also uniquely defined y speu ymg 
tlie root and the communication stage. Finally, SBN communication patterns can be efficien y cm e e 
into different parallel architectures in a topology-independent manner [4, 7], 


2.2 Proposed Load Balancing Algorithm 


Our SBN-based load balancer adapts its behavior according to the system load. Under heavy (light) load, 
the balancing activity is primarily initiated by processors that arc lightly (heavily) loade , an is con 
by two system load thresholds, MinTh and MaxTh Performance is influenced by the choice of values for 
MinTh and MaxTh. If MinTh is too small, a processor could become idle before receiving a lon ‘ ,0 s 
for processing. On the other hand, a large value of MinTh could trigger unnecessary balancing ac ivi y. 
Similarly, if MaxTh is too small, an excessive number of jobs will be migrated; if too large, jo s wi no 
adequately migrated under light system loads. Moreover, once there is suffi cient load in e system, very 

little load balancing activity should be required. ....... .. . .. ti 

The load balancer processes two types of messages: (i) load balancing messages an (n) J 1 ' 
messages. A load balancing message is broadcast when a processor p determines that its weig te < que . 
length QWgt (p) < MinTh. Such messages are also broadcast if QWgt(p) > MaxTh, or if distn utiono exc 
jobs causes other processors to exceed MaxTh. As the load balancing message passes from one processt 
another, the average weighted system load, WSysLL, is computed. 
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Job distribution messages are used to distribute jobs when QWgt (p) > MaxTh. They are also used to 
complete the load balancing process. After the WSysLL value is calculated, a distribution message is broad- 
cast through the SBN so that jobs are routed to lightly-loaded processors and the system control variables 
(MinTh, MaxTh, and WSysLL) can be globally updated. As a result, all processor workloads are balanced. To 
reduce message traffi c, a processor does not initiate additional load balancing activity until all the previous 
messages that have passed through it have been completely processed. 

Note that it is possible to encounter a situation when there are so many jobs in the system that at least one 
processor will have its MaxTh value exceeded. This would lead to thrashing, where jobs are unnecessarily 
routed back and forth among processors. To prevent this situation, if a processor at the last SBN stage 
determines that its MaxTh has exceeded, it triggers a load balancing message instead of distributing the 
excess load. As a result, WSysLL and MaxTh are globally recomputed. 

Let us now discuss the various parameters and implementation details involved in the SBN-based load 
balancer. These parameters are necessary to provide a global view of the system and make the SBN approach 
effective for adaptive mesh applications. 

2.2.1 Weighted Queue Length and System Load 

The queue length (computation time) of a processor p is not an accurate estimate of the amount of time 
required to complete its work, particularly in applications where the mesh is adapted. To achieve a better 
load balance, we defi ne a new metric called weighted queue length, QWgt(p), that also considers the com- 
munication and redistribution costs. Let Wgt" be the computational cost to process a vertex u, Comm" be the 
communication cost to interact with the vertices adjacent to v but whose data sets are not local to p y and 
Remaps be the redistribution cost to copy the data set for v to p from another processor. Then 

QWgt(p) = (Wgt" + Comm" + Remap"). 

v assigned to p 

Clearly, if the data set for v is already assigned to p , no redistribution cost is incurred, i.e., Remap^ = 0. 
Similarly, if the data sets of all the vertices adjacent to v are already assigned top, there is no communication 
cost, i.e., Commp — 0. 

The weighted system load, WSysLL, is computed as 

WSysLL = 

where P is the total number of processors used. 

2.2.2 Prioritized Vertex Selection 

When selecting vertices to be processed, the SBN-based load balancer utilizes the underlying structure of the 
adaptive mesh to defer execution of boundary vertices as long as possible since they could be migrated for 
more effi cicnt execution. Thus, selection of the queued vertex to be processed next is based on the goal that 
the overall edge cut of the adapted mesh is minimized. A priority min-queue is maintained for this purpose, 
where the priority of a vertex v in processor p is given by (Comm£ + Remapp/Wgt". Therefore, vertices 
with no communication and redistribution costs are processed fi rst, while those with high communication 
or redistribution overhead relative to their computational weight are executed last. Conceptually, internal 
vertices arc processed before those on partition boundaries. 


-^^Q w gt(p) , 

p= i 
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2.2.3 Differential Edge Cut 


To balance the system load among processors, an optimal policy for vertex migration needs to be established^ 
When vertices are being moved between processors, assume that processor p is about to reassign some of 
its vertices to another processor q. The SBN-based load balancer running on p randomly picks a subset 
of vertices from those queued locally. For the experiments reported in this paper, picking a subset of ten 
vertices worked best. This random procedure reduces the vertex selection overhead since a sorted list of 
vertices (by migration priority) does not have to be maintained. The motivation was not to fi nd the absolute 
best vertex to migrate, but rather to identify a vertex that would improve the edge cut as well as the load 
balance when moved. 

For each selected vertex v, the differential edge cut 1 , ACut(w), is calculated as 


ACut(u) = Remap” — Remap” + Comm” — Comm£. 



then Remap” = 0 but Remap” =8. . 

A negative ACut(u) value indicates a reduction in communication and redistribution costs it v is mi- 
grated from p to q\ hence, migration of vertices with the largest absolute reduction in these costs is favored. 
Once the differential edge cut values are calculated for all the randomly chosen vertices, the vertex v wi 
the smallest value is chosen for migration. Next, following a breadth-fi rst search, the SBN load balancer 
selects the vertices adjacent to v' that are also queued locally for processing on p. The breadth-fi rst searc 
stops either when no adjacent vertices are queued for local processing atp, or if a suffi cient number ol ver- 
tices have been found for migration. If more work needs to be transferred out of p, another subset of vertices 
are randomly chosen and the procedure is repeated. This migration policy therefore strives to maintain or 
improve the cut size during the execution of tire load balancing algorithm. 

2.2.4 Data Redistribution Policy 

The redistribution of data is performed in a lazy manner. In other words, the data set lor a vertex v in 
processor p is not moved to another processor q until the latter is about to execute v (q notifi es p when this 
happens). Furthermore, the data sets of all vertices adjacent to v that are also assigned to q are migrated 
with the data set of v. This policy greatly reduces the redistribution and communication costs by avoiding 
multiple data migrations, and having resident on q all adjacent vertices of v while v is being processed by q. 

Data migration is implemented by broadcasting a job distribution message when a vertex is about to 
be processed and its corresponding data set is not resident on the local processor. A locate -message is 
then broadcast to indicate die new location of the data set, so that all processors can update their records. 
This policy is expected to maximize the number of adjacent vertices (hat are local when a given vertex is 
processed. Hence, by considering the underlying mesh structure, the communication overhead is reduced. 

2.3 An Illustrative Example 

Fig 9 illustrates the SBN-based load balancer just described. It shows a mesh of 16 vertices and 20 edges 
that is partitioned among four processors, P0 through P3. For each vertex, die processing and redistribution 
costs arc represented as a two-tuple. Adjacent vertices are connected by edges which are labeled widi e 
associated communication cost, provided die data sets for the two vertices reside on different processors 
when cidier one is processed. 

’ We are deviating from the usual defi nition of edge cut to account for the dynamic nature of the SBN load balancer. 
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Figure 2: An example to illustrate the SBN-based load balancer. 


Table 1 shows the Wgt v , Coming, and Remap^ values for each vertex v, under the current vertex-to- 
processor assignment. We assume that the data for vertex 7 is resident on PI, the data for vertex 10 is on 
P2, while the data for vertices 9, 1 1, and 16, are resident on PO. The data sets for the remaining vertices 
reside on the corresponding processor to which they are assigned. Table 1 also shows the QWgt(p) values 
for each processor j>, as defi ned is Section 2.2.1. The weighted system load, WSysLL, for this example is 24. 


Table 1 : Various Costs for Each Vertex v, and the Weighted Queue Length for Each Processor p 


processor p 
vertex v 

10 

13 

PO 

14 

15 

PI 

2 

1 

P2 
5 6 

9 

3 

4 

7 

P3 

8 

li 

12 

16 

Wgt" 

2 

1 

1 

2 

1 

1 

1 3 

1 

2 

1 

4 

1 

3 

1 

1 

Comm” 

0 

3 

5 

2 ! 

2 

6 

3 6 

5 

3 

0 

12 

0 

3 

3 

3 

Remaps 

1 

0 

0 

0 

0 

0 

0 0 

2 

0 

0 

7 

0 

1 

0 

i 

QWgt(p) 



17 


3 


28 


46 


If we assume that MinTh — 10, processor PI is clearly underloaded. According to the SBN communica- 
tion pattern shown in Fig. 1(a), PI sends a load balancing request to P3. Upon receiving it, P3 determines 
which of its vertices to transfer to PI so that their loads will be equidistributed. Let us step through the 
process of selecting the fi rst vertex to migrate, using the differential edge cut described in Section 2,2.3. The 
ACut(r>) values of die vertices v currently assigned to P 3 are shown in Table 2. Vertex 7 is found to be 
optimal for migration to PI, yielding QWgt(P3) — 23 and QWgt(Pl) = 18. The new value of WSysLL is 


Table 2: Differential Edge Cut for Each Vertex v in Processor P3 if Migrated to PI 


vertex v 

3 

4 

7 

8 

11 

12 

16 

Remap pj 

1 

2 


3 

1 

• 1 

1 




7 

0 

1 

0 

1 


6 

1 

11 

6 

3 

7 

1 

Coming 

3 


12 

0 

3 

3 

3 

ACut(ij) 

4 

3 

-8 

9 

0 

5 

-2 
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22, which reflects a reduction in the total system load. For this example, additional vertex migration is not 
required. 


2.4 SBN Pre-Partitioner 

The SBN-based load balancing algorithm is designed to run dynamically without the need for a separate 
partitioning process. This is a signifi cant advantage over existing approaches where processing is temporar- 
ily suspended when processor loads become unbalanced. During the suspension, vertices are reassigne an 
the corresponding data sets are remapped. The asynchronous nature of the SBN strategy also allows e 
computational, communication, and redistribution phases to be overlapped, leading to further reductions m 
the overall execution time. Traditional methods [15, 19, 20] cannot achieve this overlap easily because these 

phases are processed sequentially. . 

To test the behavior of the SBN technique, we implemented a pre-partitioner which can optionally run 
prior to each mesh adaptation phase." We wanted to determine whether running a front-end partitions has 
any signifi cant benefi t on the resulting communication and/or redistribution overhead. This pre-partitioner 
is unique in that it partitiorisTfased on QWgt(p) values, which take into consideration all three factors ot 
computation, communication, and redistribution. This is a stronger requirement than that consi ere in 
almost all othe r approaches [8, 11, 17,21], where the mesh is partitioned to equalize the total computational 
cosT while minimizing the total number of cut edges. Such methods could result in signifi cant idle time 
during processing if only a few processors incurred most of the communication overhead. 

The pre-partitioner differs from the partitioning capabilities inherent in the SBN-based load balancer in 
that multiple iterations are performed to fi nd an optimal P- way partition. Here, an iteration is defi ne as 
a sequence of vertex rcassignments from one processor to another. During an iteration, each vertex can e 
reassigned at most once. Reassignments are made so that vertices in processor p with QWgt(p) > WSysLL 
are assigned to the processor q with the minimum QWgt(<?) value. Each vertex to be reassigned is adjacent 
to a random subset of vertices chosen from and belonging to q. First, the ACut(u) values are compute 
for all adjacent vertices v assigned to processors other than q. As described in Section 2.2.3, the non oca 
adjacent vertex v' with the smallest ACut(u') is added to the set of vertices assigned to q. In addition a 
breadlh-fi rst search is performed on the vertices adjacent to tl that are not assigned to q but to p sue i at 

QWgt(p) > WSysLL. These vertices are also assigned to q. 

The pre-partitioner is initially set to execute a fi xed number of iterations (four for die experiments in 
this paper) However, additional iterations are performed if a new minimum WSysLL is achieved. At the 
end of each iteration, the load imbalance factor QWgt(r) /WSysLL for the processor r with die largest value 
of QWgt(r) is computed. If this factor is greater than a specified threshold (1.75 in our experiments), le 
Kemighan-Lin refinement procedure [121 is invoked to further reduce WSysLL. Note that the data associ- 
ated with each vertex is not migrated after the pre-partitioning process is completed. Instead, actua ata 
movement takes place during mesh adaptation as vertices are processed with SBN load balancing in c ec . 


3 PLUM Framework 

We experimentally compare the performance of our SBN-based load balancer with PLUM [15], a portable 
and parallel load balancing framework for adaptive unstructured grids. In PLUM, when processor wor oa s 
become unbalanced due to adaptation, the mesh is repartitioned and the subgrids reassigned to ic proces 
sors. If the estimated remapping cost exceeds the expected computational gain, execution continues wi ou 
remapping. Otherwise, the grid is remapped among the processors before the computation is resumed. For 
the sake of completeness, a brief description of the important features ot PLUM is gi\ en below. 
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3.1 Reusing the Initial Graph 

PLUM repeatedly utilizes the initial mesh for die purpose of load balancing. The computational weight, 
Wgt", of a vertex v in the corresponding dual graph, is the number of leaf elements in the refi nement tree 
because only those elements with no children participate in the numerical computation. The redistribution 
cost. Remap", is the total number of elements in the refinement tree because all descendents of the root 
element must be moved from one partition to another when the load is to be rebalanced. Lastly, the commu- 
nication cost, Comm e , of a dual graph edge e, is set to the number of corresponding faces in the computational 
mesh. These weights are used to determine an optimal partitioning that achieves balanced workloads among 
processors, to minimize the resulting communication, and to optimize the data movement cost. 

3.2 Parallel Mesh Repartitioning 

PLUM can use any general-purpose partitioner to rebalance processor workloads after a mesh adaptation. 
In [2], PMeTiS [1 1] and DMeTiS [17] were used. Both partitioners are parallelized and highly optimized for 
maximum effi ciency, and have proven effective for adaptive grids. DMeTiS is a diffusive scheme designed 
to modify existing partitions, while PMeTiS is a global from-scratch partitioner that makes no assumptions 
on how the mesh is initially distributed. Both are multilevel algorithms that operate in tliree phases: (i) a 
coarsening phase, where the original mesh is reduced by collapsing adjacent vertices to a suffi ciently small 
mesh; (ii) a partitioning phase, where the coarsened mesh workload is balanced among the processors and 
the edge cut size is minimized; and (iii) a projection phase, where the partitioned mesh is gradually restored 
to its original size. 

DMeTiS and PMeTiS differ mainly in how they perform the partitioning phase. DMeTiS uses a di- 
rected 2-norm minimization algorithm [10] which provides a global picture of the existing mesh. Vertices 
in heavily-loaded partitions that are adjacent to neighbors in more lightly-loaded partitions are randomly 
visited. The diffusion process computes a flow value for possible reassignment to neighboring partitions. If 
the flow value relative to the vertex weight is high, the vertex is reassigned. This process continues until the 
partition is balanced or no further progress can be made. If a balanced partitioning cannot be achieved at 
the current level of the mesh, it is projected to the next fi ner level and the partitioning process is repeated. 
PMeTiS, on the other hand, utilizes a greedy recursive bisection algorithm to create a partition of the graph 
from scratch. The time complexity for both algorithms is minimal since the partitioning is performed on a 
coarse graph containing a small number of vertices and edges. 

3.3 Processor Remapping 

The goal of processor reassignment is to fi nd a mapping between partitions and processors that minimizes 
the cost of data redistribution. To achieve tins, PLUM computes a similarity matrix 5, where entry S l j is 
the sum of the Remap" values of all vertices in the new partition j that already reside on processor i. Various 
cost functions [16] are usually needed to solve the reassignment problem using S for different machine 
architectures. In [2], an effi cient heuristic algorithm was developed to minimize the volume of data that 
is moved among the processors. This algorithm has been shown to be no worse than twice the optimal 
performance. 

3.4 Cost Model 

Predicting the expected redistribution overhead is diffi cult because of die large number and complexity of die 
costs involved. For example, it includes the cost for rebuilding internal data structures and updating shared 
boundary information. Furthermore, the total redistribution cost depends on die architecture and on die 
many-to-many communication patterns used by the remapper. In PLUM, die equation 7 x MaxSR+O is used 
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lo model the total cost [2, 16]. Here, 7 represents die computation and communication overhead to process 
each redistributed element, MaxSR is the maximum number of elements sent and received by any processor, 
and () is the predicted sum of all other constant overheads such as data compaction, communication latency, 
and barrier synchronization. A least squares fi t can be used to approximate 7 and O for various architectures, 

while MaxSR is computed from the similarity matrix S. 

Once the redistribution cost is computed, it can be compared with the expected computational gam 
achieved by reducing the load imbalance among the processors. If the computational gain is larger than the 
redistribution cost, the new partitioning and mapping are accepted. Otherwise, the computation is resumed 
on the unbalanced mesh. 

3.5 Differences with SBN-Based Load Balancer 

The SBN load balancing algorithm differs from PLUM in several ways. Here we itemize the salient differ- 
ences: 

• Processing is temporarily halted under PLUM while the load is balanced. During the suspension, a 
new partitioning is generated and data is redistributed among the processors. The SBN approach, on 
the other hand, allows processing to continue asynchronously with load balancing. This feature allows 
the possibility of utilizing latency-tolerant techniques to hide communication and redistribution costs 
during processing. 

• With PLUM, the suspension of processing and subsequent repartitioning does not guarantee an im- 
provement in the quality of load balance. If it is determined that the estimated remapping cost exceeds 
the expected computational gain, processing continues using the original mesh assignment. This could 
result in unnecessary idle time. In contrast, the SBN approach, when active, always reduces the exe 
cution time for the application. 

> PLUM redistributes all necessary data to the appropriate processors before processing is restarted. 
SBN, however, distributes work in a lazy manner, i.e., data is migrated to a processor only when it is 
ready to process the data. In this way, some of the redistribution and communication overhead can be 
avoided. 


4 Experimental Study 

The SBN-based load balancing algorithm has been implemented using MPI on the wide-node IBM SP2 
located at NASA Ames Research Center, and tested with actual workloads obtained from an adaptive 
unstructured-grid calculation. 


4.1 Test Case 

The computational mesh used for the experiments reported in this paper simulates an unsteady environment 
where the adapted region is strongly time-dependent. This goal is achieved by propagating a simulated shock 
wave through the initial mesh as shown in Fig. 3. The test case is generated by refi ning all elements wi m 
a cylindrical volume moving left to right across the domain, while coarsening prcviously-refi ned elements 
in its wake. Performance is measured at nine successive adaptation levels, during which the weighte sum 
of the vertices increased from 50,000 to 1 ,833,730. The levels shown in Tables 3 and 4 indicate successive 
positions of the shock wave as it progresses through the cylindrical volume. This test case was chosen so 
that results could be compared with those compiled in [2] under the PLUM environment. 
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Figure 3: Initial and adapted meshes (after levels 1 and 5) for the simulated unsteady experiment. 


4.2 Performance Metrics 

The following metrics were chosen to evaluate the effectiveness of the SBN-based load balancer when 
processing an unsteady adaptive mesh. Recall that v denotes a vertex to be processed and P is the total 
number of processors. 

• Cut percentage: The runtime interaction between adjacent vertices residing on different processors 
is represented by this metric as: 


Cut% = 100 x £ Coming 

p£P v assigned to p 



where Comm e is the weight of edge e in the adaptive mesh. The Cut% value should be as small as pos- 
sible. The PrePartCut% (see Table 3) is the projection of the mesh edge cut before running the SBN 
pre-partitioner. On the other hand, PreExecCut% computes the mesh edge cut immediately before 
processing a mesh adaptation level, while PostExecCut% is the actual cut realized after processing 
(lie given adaptation level. 

• Maximum redistribution cost: The goal of this metric is to capture the total cost of packing and 
unpacking data, separated by a barrier synchronization. Since a processor can either he sending or 
receiving data, die overhead of these two phases is modeled as a sum of two costs as: 

MaxSR — max j Remap^j + max j ^ Remappj. 

^ • ** v sent from p ^ ^ v reev by p 

Since MaxSR pertains to the processor diat incurs the maximum remapping cost, a reduction in die 
total data redistribution overhead can be guaranteed by minimizing MaxSR. 

• Load imbalance factor: This metric is the ratio of the work on the most heavily-loaded processor to 

die average load across all processors, and is formulated as: 
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Loadlmb = max QWgt(p) / WSysLL. 
p£P 

The Loadlmb factor should be as close to unity as possible. 


4.3 Summary of Results 

Table 3 presents performance results of processing the adaptive mesh using the SBN-based load balancer 
with and without tire SBN pre-partitioner running between adaptations. Table 4 charts the results achieve 
using the PMeTiS and DMeTiS partiUoners within the PLUM environment. Note that Table 4 does not 
contain results corresponding to all the processor sets shown in Table 3. We have included only those values 

that were available to us. , . , , . , m 

The Loadlmb factors are not shown in Table 3 since they were consistently between . ’ 

indicating that the quality of load balance with the SBN-based approach was extremely high. In contrast, this 
factor was respectively 1 .04 and 1 .59 for P = 32 using PMeTiS and DMeTiS under the PLUM environment 
(sec Table 4). Obviously, Loadlmb is poorer for DMeTiS because of its diffusive nature. 

Results show that the SBN PostExecCut%, when using the pre-partitioner, is more than double com- 
pared to those reported by PMeTiS (21 .29 in Table 3 vs. 10.94 in Table 4, for P = 32). The difference is 
almost negligible when compared to the results obtained with DMeTiS (20.22 in Table 4). This could re- 
flect the effectiveness of the partiUoners being used rather than whether the SBN-based load balancer would 
always produce higher communication costs. Note thatPostExecCut% is about 1.5 times higher when the 
SBN pre-partitioner is not active (see Table 3). This implies that it may be useful to initially partition die 
mesh to compute a starting point for subsequent SBN load balancing when high communication cost is a 

ori lie 3-1 factor 

The MaxSR metric is proportional to the redistribution cost incurred while processing the adaptive mesh. 
The SBN lazy approach to migration of vertex data sets produces significantly lower values ™ sc 

achieved by PMeTiS or DMeTiS under PLUM. For example, when P = 32, Table 3 shows MaxSR - 2 , 
without the SBN pre-partitioner, which is signifi candy less than the corresponding values in a e ( - - 
for PMeTiS and 62,542 for DMeTiS). However, when the SBN pre-parUUoner is used with the load b<dancer, 
the MaxSR value increases (see Table 3). Thus, there is a trade-off here: the pre-partidoner reduces runtime 
interprocessor communication at the expense of a higher data redistribution cost. Finally, by comparing 
PrePartCut% and PreExecCut% in Table 3, observe that Cut% degrades as die pre-partitioner executes. 
This result is consistent with the observations drawn from the PLUM experiments [2], 

In conclusion, our experiments demonstrate that using the SBN pre-partitioner produces lower commu- 
nication costs but higher data remapping costs. Although die pre-partitioner may be of limited value for 
those adaptive mesh applications where remapping costs dominate communication costs, it could be useful 
in scenarios where reducing the communicadon cost is more important. Overall, these per ormancc resu s 
demonstrate that die proposed SBN-based dynamic load balancer is effective for processing adaptive mesh 
problems by providing a global workload view across processors. In many mesh applicadons where die cos 
of data redistribution dominates the cost of communication and processing, the SBN-based algorithm would 

be preferred. 


4.4 Complexity Analysis 

In this section, we analyze the overhead associated with the execution of the SBN-based load balancer 
while processing the adaptive computadonal mesh. The overhead has four components, (l) se ec ng it 
next vertex to be processed; (ii) selecting the set of vertices to be migrated; (in) processing to determin 
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Table 3: Performance Results using the SBN-Based Load Balancer with (without) Pre-Partitioning 


PrePartCut% 


PostExecCut% 


(0.09) 

0.95 

(4.64) 

9,606 

(6,974) 

(3.14) 

1.60 

(6.18) 

41,926 

(30,538) 

(5.36) 

2.60 

(6.08) 

178,631 

(57,724) 

(3.93) 

2.31 

(3.86) 

118,679 

(20,646) 

(2.91) 

2.39 

(5.32) 

112,437 

(76,893) 

(2.33) 

2.93 

(4.62) 

87,517 

(103,544) 

(2.23) 

1.78 

(5.86) 

75,925 

(140,904) 

(2.83) 

2.08 

(6.14) 

223,160 

(153,735) 

(3.10) 

2.18 

(6.89) 

103,772 

(129,374) 

(2.88) 

2.09 

(5.51) 

105,739 

(80,037) 1 











Level PreExecCut% 


Average 



Average 7 18 


(4.6: 
(19.26) I 
(21.14) 1 
(17.13) 
(29.08) 
(25.31) 
(20.55) 
(10.04) 
(9.41) 

(17.40) 


(20.50) 

(25.26) 

(28.21) 

(26.46) 

(24.38) 

(14.17) 
(13.08) 

(14.18) 

( 20 . 22 ) 


16,758 

39,565 

73,074 

92,581 

82,751 

88,642 

91,301 

79,662 


(17,393) I 

(44,413) 

(99,232) 

(97,280) 

(86,204) 

(78,312) 

(72,474) 

(62,522) 


( 1 . 88 ) 

( 2 . 12 ) 

( 2 . 12 ) 

(1.87) 

( 1 . 68 ) 

(1.41) 

( 1 . 11 ) 

(1.05) 

(1.05) 


63,270 (62,542) 


load balancing is necessary; and (iv) load balancing and job distribution messages (communication) among 
processors Where possible, both analytical formulas and experimental data are presented. 

The vertex v to be processed next is selected using a priority min-queue. Let V p be the set ol vertices 
to be processed at a given processor p, and E p be the set of all internal 

to the vertices in V p . Heap operations like create and insert/delete-mm require 0{V P ) and 0(log V p ) , 
respectively. The (non-standard) removal operation can be implemented in O(logVp), provided a direct 
pointer to the entry to be removed is maintained. However, for SBN processing, the sura (Wgt + Comnip + 
Remap") must be computed so that the value of QWgt(p) can be obtained (see Section 2.24). Also, (Conmi p + 
Remap") /Wrt” is needed to correctly control the ordering of the priority nun-queue (see Section 2.2.2). bacn 
of these calculations requires 0{5 V ) time, where S v is the degree of v. Therefore, the SBN priority mtn-queue 
(heap) creation requires 0(V P + Z veVp = 0(V P + E p ) time. Similarly, each heap insertion, delete-mm, 

and removal operation completes in 0( log V P + S v ) time. . . ... 

SBN vertex migration involves first selecting a random set, R, of vertices from lose queue • 

The vertex v' £ R witli the smallest ACut(v') value is chosen for migration (see Section 2.2.3). bach 
ACut(r) calculation completes in 0{6 r ) time, where r e R. Therefore, the total time required to select the 
initial vertex for migration is 0(£ re R M « 0(\R\ x S avg ), where is the average degree of a vertex in 
the mesh Next, the local queue is searched in a breadth-first manner to choose an additional set, ot 
vertices for migration with v'. In our experiments, [VW s i averaged less than ten to satisfy the requirements 
of a load balancing operation. Furthermore, a single search almost always found enough vertices to migrate. 
Thus the time required to complete the breadth-fi rst search is 0(|K» 5 | + E„cV' m , 9 <M ~ ^U^ntsl x ( + 
S„ )’). Finally, each vertex to be migrated must be removed from the priority nun-queue so that they wi 

dVQ / / J i« * i.i t rfinntrnrl fllf 1 flTTlP 


no longer be considered for local processing. Since 
complexity for this step is 0((|L m i s j + l) x log V p + Y,uev mi 


+ 1 removal operations are required, die time 
5„) w 0(|H mls |x(lo g y p + l 5 a , >(; )). Combining 
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the above three terms and considering that \R\ is a constant, the overall asymptotic time complexity for job 
migration is 0(\V mig \ x (logp^ + <5 at , p )). 

Each processor must periodically check whether a load balancing operation should be initiated or if 
messages from other processors need to be processed. If processors check too frequently, the associated 
overhead could be too high. On the other hand, infrequent checks for load balancing activity could lead 
to excessive idle time. The following analysis can be used to minimize this overhead without signifi cantly 
increasing processor idle time. If / is the frequency of a processor checking for load balancing activity, the 
average response time to process a message is 2//. Each time the SBN-based load balancer is invoked, bal- 
ancing and distribution messages pass through 31ogP communication stages. Therefore, the total response 
time to balance the system load is (61ogP)//. If J aV g is the average number of jobs processed per unit 
time, the MinTh threshold should be set such that load balancing will be triggered when QWgt (p) < MinTh, 
to avoid excessive idle time (see Section 2.2). In other words, MinTh > [61ogP x Javg/ f 1 • 

The communication overhead due to message passing is measured experimentally. Table 5 shows the 
number of Mbytes that were transferred between processors during the load balancing and job distribution 
phases. The data volumes are also expressed as percentages of the available bandwidth. A wide-node SP2 
has a bandwidth of 36 Mbytes/sec and a latency of 40 fisecs. As expected, the cost of workload migration 
is signifi cantly larger than the cost of actually balancing the system load. An extrapolation of the results 
using an exponential curve-fi tting program indicates that parallel speedup will not scale past 128 processors. 
Most of the overhead is due to the latency associated with transmitting many small messages; however, it 
is asymptotically sublinear in the total number of processors used. Future research will investigate utilizing 
latency-tolerant techniques to allow for bulk transfers. 


Table 5: Communication Overhead of the SBN-Based Load Balancer 


p 

Load Balancing Phase 
Volume (MBytes) Bandwidth (%) 

Job Distribution Phase 
Volume (MBytes) Bandwidth (%) 

2 

0.342 

0.00 

3.919 

3.67 

4 

0.150 

0.00 

7.939 

7.44 

8 

0463 

0.01 

25.397 

23.79 

16 

0.581 

0.02 

30.454 

28.53 

32 

1.550 

0.12 

38.244 

35.83 


Table 6 shows the fraction of time spent in the SBN-based load balancer compared to the total execution 
time of the mesh adaptation application. The three columns in the table correspond to three categories 
of load balancing activity: (i) time needed to handle load balancing messages, (ii) time needed to migrate 
vertices from one processor to another, and (iii) time needed to select the next vertex to be processed. Results 


Table 6: Percentage Overhead of die SBN-Based Load Balancer 


P 

Balancing 

Activity 

Migration 

Activity 

Vertex 

Selection 

2 

0.0153 

0.0414 

0.8490 

4 

0.0187 

0.1069 

1.1361 

8 

0.1245 

0.1969 

1.9886 

16 

0.6369 

0.2829 

2.1145 

32 

0.1554 

0.3774 

2.8543 
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show that processing related to the selection of vertices is the most expensive phase. The SBN algorithm 
dynamically chooses the next vertex to be processed, depending on specifi c runtime criteria. Thus, vertex 
selection is not as effi cient as parallel multilevel partitioning. However, the data movement cost in the SBN 
approach is substantially smaller than that of traditional remapping schemes since it allows processing to 
continue while the load is dynamically balanced, thereby overlapping processing and migration. Overall, 
the total overhead of our load balancer is relatively small compared to the time spent processing the mesh. 


5 Summary 

In this paper, we have described a novel topology-independent approach to solving the dynamic load bal- 
ancing problem for adaptive meshes. Our thorough experimental investigation with an unstructured adapUve 
mesh application showed that the proposed SBN-based load balancer achieves a lower redistribution cost 
than that under the PLUM environment. This was possible by overlapping processing and data migration^ 
However, the communication costs using SBN were signifi candy higher than those reported under PLUM. 
Overall, the SBN approach was demonstrated to be a viable option in load balancing dynamic irregular 

applications, . , 

The SBN-based load balancer is not purely diffusive, in that work is not necessarily migrated to neig 

boring processors. In fact, a vertex is usually redistributed to a processor that owns an adjacent vertex. 
While diffusive strategies are fairly common, scratch-remap techniques (similar to that used in PLUM) have 
also been used successfully to load balance adaptive mesh applications. Our more recent work on die SGI 
0rigin2000 system is consistent with die performance results presented here, showing the portability of the 
SBN-based load balancing algoridim. 

Because of its latency-tolerance feature, it seems natural to evaluate the performance of die SBN ap- 
proach on a heterogeneous cluster of computers. Another research arena includes strategies to adapt the 
processing to situations where some of the processors in die network become unavailable during a compu- 
tation. Such fault tolerance would allow applications to make use of resources that are constantly changing 
during execution. Finally, die techniques presented here could he applied to odier practical applications, 
such as multimedia image processing and data mining, where load balancing is an important issue. These 
will be die focus of future research. 
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